Fault - Tolerant Clusters of Workstations with Single System Image

نویسندگان

  • Kai Hwang
  • Edward Chow
  • Cho-Li Wang
  • Hai Jin
  • Zhiwei Xu
چکیده

he computing trend is moving from clustering highend mainframes to clustering desktop computers. This trend is triggered by the widespread use of PCs, workstations, Gigabit networks, and middleware support for clustering. This paper presents new approaches to achieving fault tolerance and single system image (SSI) in a workstation cluster. A multicomputer cluster is a collection of node computers, which are physically connected by local area networks or high-bandwidth switch networks using optical fibres. The workstations in the cluster can work collectively as an integrated computing resource, that is a SSI, or they can operate as individual computers, separately.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fault tolerant system with imperfect coverage, reboot and server vacation

This study is concerned with the performance modeling of a fault tolerant system consisting of operating units supported by a combination of warm and cold spares. The on-line as well as warm standby units are subject to failures and are send for the repair to a repair facility having single repairman which is prone to failure. If the failed unit is not detected, the system enters into an unsafe...

متن کامل

Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing

Networks of workstations (NOWs) offer a cost-effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless chec...

متن کامل

Fault-tolerant Cluster Management for Reliable High-performance Computing

Clusters of COTS workstations/PCs are commonly used to implement cost-effective high-performance systems. A central coordinator/manager is often the simplest way to implement many of the operations required for managing these distributed systems. These operations include scheduling of parallel tasks, coordination of access to limited resources, as well as high-level coordination of fault tolera...

متن کامل

Development and Performance Analysis of a Fault Tolerant Algorithm for Cluster of Workstations

A Cluster of Workstations (COW) is network based multi-computer system, which is the most prominent distributed memory system aimed to replace supercomputers. A cluster of workstations can be viewed as a single machine in which one job is divided into n subtasks and delegated to n workstations in the COW architecture. To get the job completed, all subtasks assigned to component workstations mus...

متن کامل

DDG Task Recovery for Cluster Computing

This paper presents a solution for the problem of transparent recovery of asynchronous distributed computation on clusters of workstations when a fault occurs on a node. If the system has fault-tolerant features, it can survive the fault and continues its computations. Performance degradation is unavoidable when hardware redundancies are not available. It is a large advantage if the long-runtim...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998